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T-H Abstract 

o 

^Sj Preserving the privacy of individual databases when carrying out statistical calculations has 

^ a long history in statistics and had been the focus of much recent attention in machine learning 

Q In this paper, we present a protocol for computing logistic regression when the data are held by 

separate parties without actually combining information sources by exploiting results from the 
literature on multi-party secure computation. We provide only the final result of the calculation 
compared with other methods that share intermediate values and thus present an opportunity 
for compromise of values in the combined database. Our paper has two themes: (1) the de- 
I— I velopment of a secure protocol for computing the logistic parameters, and a demonstration of 

its performances in practice, and (2) and amended protocol that speeds up the computation of 
the logistic function. We illustrate the nature of the calculations and their accuracy using an 
extract of data from the Current Population Survey divided between two parties. 

Keywords: Distributed analysis; Logistic regression; Privacy-preserving computation; Se- 
cure multiparty computation. 

> 

1 Introduction 

(N 

Privacy concerns are becoming more and more acute, especially in the digitized world where new 
supercomputers with an increasing processing capacities appear almost every day. These new 
T-H machines together with impressive new technologies make the process of data collection, data 

storing and data analysis as easy as ever. This "ease of use," may be manipulated by untrustful 
^> elements, whose aim is to deliberately cause harm by, for example, identifying and exposing sensitive 

^ data. It is the goal of privacy preserving methods to prevent or at least lessen the chances of such 

^ harmful actions from happening. In this paper we present a novel way to achieve the goal when a 

certain statistical analysis is required. 

Preserving the privacy of individual databases when carrying out statistical calculations has a 
long history in statistics and had been the focus of much attention in machine learning, e.g., see [1]. 
Once data are merged across sources, however, privacy becomes far more complex and a number of 
privacy issues arise for the linked individual files that go well beyond those that are considered with 
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regard to the data within individual sources. When the goal is the production of the results of some 
statistical calculation, such as a regression analysis, c.f. Karr et al. [13, 14], we can often exploit 
results from the cryptography literature, borrowing tools such as secure multi-party computation, 
e.g., see [16, 20]. Secure multi-party protocols are concerned with distributed computation where 
each participating party, holding a private input, learns nothing but the result (see Section 2). 

This paper has two main themes. In both themes we conceptualize the existence of a single 
combined database containing all of the information for the individuals in the separate databases 
and for the union of the variables. We propose an approach that gives full statistical calculation 
on this combined database without actually combining information sources, see [15, 16]. We focus 
mainly on logistic regression, but our methods and tools are essentially adaptable to other statistical 
models, as indicated in Section 8. Our approach provides only the final result of the calculation 
compared with other methods that share intermediate values and thus present an opportunity for 
compromise of values in the combined database, c.f. [5, 7]. We remark that our problem differs 
from the one studied by Chaudhuri and Monteleoni [4] using differential privacy, since they are 
concerned with information leakage by the output of the computation, whereas we are concerned 
with leakage from the computation itself ! 

The first theme is the development of a novel approach to perform the calculations required for 
fitting logistic regression models when the data are distributed among several parties. In our settings 
the parties are unwilling or are simply forbidden (by law regulations) to share their respective data. 
They acknowledge the fact that pooling their private data into a conceptual global database, and 
running the logistic regression on the pooled data, rather than on their own data, can only lead 
to a better statistical analysis. We develop a secure protocol to compute the maximum likelihood 
estimates of the logistic parameters. Throughout the paper we make repeated use of what is known 
as random secret sharing, which enables us to keep intermediate parameter values secret. The first 
theme aims at performing the required calculations by using operations which are restricted to a 
linear algebra type. Note that the fitting process requires computing the logistic function which is 
highly non-linear. In principle, we may perform any computation securely, by making use of Yao's 
general protocol [21]. Nonetheless, this is in essence a theoretical construction which will often be 
inefficient for large computations [20]. Instead, we craft a specially designed approximation to the 
logistic function, which can be securely evaluated using the machinery of random shares and Yao's 
millionaire protocol. 

We establish the theoretical validity of the secure protocol for computing the logistic parameters, 
and show its performances in practice. In high dimensional problems with large number of cases our 
protocol may require faster computing resources. This is mainly because our approximation requires 
computing the predicate "greater-than," which may take many encryptions. Indeed, evaluating this 
predicate by a reduction to Yao's protocol takes roughly 0{b) encryptions where b is the number 
of bits used to represent the numbers (this becomes dauntingly large due to the secret sharing 
scheme) . 

This leads us to the second theme, which tries to amend the protocol is a way that speeds 
up the computation of the logistic function. The main idea here is to avoid special circuit sub- 
protocols, such as Yao's protocol. To that end, we show that we can perform the fitting process 
using only sums and products. The advantage to this is that these computations are very well 
studied primitives in secure multiparty computation and thus we can instantiate our method in a 
different secure multiparty computation scheme (e.g., [20, 9]), depending on the security demands 
of the data holders. We propose to approximate the vector of logistic function values, by repeatedly 
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Taylor expanding around the current value and stepping along the gradient. Operations other than 
sums and products, are not needed here. In principle, the approach represents the logistic function 
as the solution to a ordinary differential equation, then applies Euler's method to approximate 
the solution. As with the first theme, we show that we can make the approximation arbitrarily 
accurate, at the expense of computational efficiency, and we present an illustrative empirical result. 

We close the introduction with a brief description of logistic regression, mainly for the purpose 
of setting notation. Logistic regression is used for predicting binary outcomes or class membership 
given a set of explanatory variables or predictors. We can use the fitted model to predict class 
membership for a newly obtained record consisting of only the values of the predictors. The basic 
framework of logistic regression treats binary responses yi, . . . , y„ as realizations of n independent 
Bernoulli random variables, Yi, . . . ,Yn, whose mean depends on a set of predictors Xi G M*^, as 
follows: 

EYi = a{xjp) , (1) 

where a{a) = (1 + exp(— a))~^, is the sigmoid (or the logistic) function, and /? is a d-dimensional 
parameter vector. This makes the log odds, log(Eyi/(l — El^)), linear in the predictors. 

A standard method for computing the maximum likelihood estimates of /3 is Newton-Raphson's 
method, since closed form expressions do not exist. The fitting process requires the user to supply 
the log-likelihood function associated with logistic regression, along with its first two derivatives. 
Suppressing dependence on the data and vector of parameters, we let i be the log-likelihood, i.e., 
I = YliiiVi^J P ~ log(l + ^)}- We also put on record the first two derivatives: 

V^ = Y,{xiy^-Xia{xJp)] , V2^ = -^a(xf/3)(l-a(xf/3))xixf . (2) 

i i 

The gradient and the Hessian are assembled together to produce an estimate of the logistic 
parameters through the iterative process: 

I3t+i = /3i - {vH)-^v^ . (3) 

Our protocol will be structured in rounds, where each round corresponds to an iteration of Newton's 
method (3) followed by a convergence check. Each round involves a loop through all the cases Xi to 
compute the contribution to the gradient and Hessian. We keep intermediate values of j5t unshared 
between the parties. This is made possible by representing /3( as random shares (see Section 3). 

The remainder of the paper is organized as follows. Section 2 presents the multi-party setup. 
In section 3 we provide several sub-protocols which we will need. Sections 4 and 5 describe our 
protocol and an approach for speeding up the calculation involved, respectively. Section 6 describes 
implementation details. Section 7 illustrates aspects of the computation on an extract of data from 
the Current Population Survey divided between two parties. Section 8 discusses possible extensions. 
We defer all technical details to Appendices A and B. 

Setting 

We let X denote the nxd design matrix, and y the n-dimensional response vector. We assume the 
presence of P > 2 parties who are interested in computing logistic regression on the total of their 
data. We suppose that the union of the parties data corresponds to the X and y of the logistic 
regression. In particular, we suppose that party j holds onto the pair {Xj.yj) with Xj ^W"^ and 
yj S {0, 1}", where Xj is the j*^ party design matrix, and yj is her (binary) response vector. 
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In this work we consider a setting where each party has an "additive share" of the dataset. 
That is, Xj = X and yj = y where X and y correspond to the design matrix and response 
vector of the combined data on which the logistic regression is performed. This subsumes all the 
partitioning schemes for the database (e.g., vertical and horizontal partitioning which are the cases 
considered in [20]) as in these cases for each element, one party holds the value and the remaining 
parties hold zero. Furthermore this setup is applicable in a case where parties may have overlapping 
data, and the logistic regression is to be learned by using a linear function of the overlapping data 
(e.g., a weighted average) as a kind of measurement error model. We suppose that the union of the 
individual data sets gives the complete data. In cases where some data are missing, we can apply 
a privacy preserving imputation method such as in Jagannathan and Wright [12] as a preprocess, 
and then run our protocol. 

We note that our method is general in the sense that it is applicable to every partitioning 
scheme, but it is clearly possible to treat specific cases such as vertically partitioned data with 
more efficient specialized protocols. 

2 Secure Multi-Party Computation 

Ideally we would like our method to provide only the output of the calculation to the parties 
involved, and reveal nothing more. This is a lofty goal without the aid of trusted third parties, 
however it is relaxed in a useful way in the cryptographic literature. First it is assumed that 
the parties are not able to quickly solve computationally hard problems (such as breaking RSA 
encryption). Then, a protocol is secure so long as intermediate values in the computation either 
contain almost no information (in the sense that the protocol would have to be re-run astronomically 
many times on the same input data in order to detect any information in the messages), or will 
only reveal information as the result of an intractable computation. We now briefiy review the 
security model we intend to use. 

We consider the "functionality" (see [9]) which maps the data of each party into the logistic 
regression parameter vector /3: 

{{Xi,yi), {X2, 2/2), •• • {Xp, yp)} ^ • • • /?} (4) 

The right hand side represents P copies of the parameter, so that each party receives the same 
output. Note that each design matrix is of the same dimensions. 

A protocol for computing the functionality is just a sequence of steps, consisting of parties 
performing local computations, and sending intermediate messages to each other. In this work we 
build up a protocol for computing (4) which is secure in the presence of "semi-honest" parties. 
That is, parties who obey the protocol (and do not try to e.g., inject malformed data) but keep a 
transcript of all the messages they receive. Intuitively, a protocol is secure in this setting whenever 
the intermediate messages give no information about the secret inputs of other parties. Formally, 
the "view" of the j^^ party during the protocol is: 

Yiewj{{Xi,yi),{X2,y2),---{Xp,yp)) = {(X^, y^), r, mi, • • • m|^|} (5) 

where r is a record of all the random draws made by party j, and is the k^^ message received 
by that party (we have dropped dependence of m on j for readability). 
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The protocol is secure so long as there exists a polynomial time algorithm which, when given 
only the input and output of party j, may output a random transcript of message which is com- 
putationally indistinguishable from viewj. See Goldreich [9] for a definition and discussion of com- 
putational indistinguishability. In essence, if the distribution of the sequence of messages depends 
only on the private input and output of party j then we can simulate messages by drawing from 
this distribution (so long as the random number generator returns samples which are computation- 
ally indistinguishable from draws from the distribution). The existence of a simulator shows that 
intermediate messages do not depend on the private input of other parties, and so the protocol is 
secure in the sense that parties gain no more information about each other's private inputs than 
that revealed by the output of the protocol. Note that this type of security is "orthogonal" to that 
studied in Chaudhuri and Monteleoni [4], which seeks to prevent leakage of secret information in 
the parameter vector. 

An example of a protocol which does not achieve this definition of security is one where all 
parties send their data to party 1, who computes the parameter locally on the combined data and 
then sends it back to all other parties. In this case the messages received by party 1 consist of the 
data of other parties, in general it is impossible to simulate these messages given only the input 
and output belonging to party 1. 

In the next section we present a protocol for performing Newton's method on the logistic 
regression objective in a way which is secure in the presence of semi-honest parties. Our protocol 
makes use of a specially designed approximation for the logistic function. Section 5 then describes a 
different approximation necessitating the operations of only sums and products, and thus speeding- 
up the computations. 

Although we propose to use the cryptographic model for security, others exist and deserve a 
place in the theory of privacy preserving data analysis. The main alternatives we see are "weak" 
security, and perturbation of the data. The former comprises a body of literature summarized 
in Vaida et al. [20]. The idea is that by giving weaker privacy guarantees, we can implement 
much more efficient protocols. Whether it is acceptable to have this weaker privacy guarantee is 
a question which one must consider on a case- by-case basis. Although we describe our protocol in 
terms of the cryptographic model, by replacing the primitive operations (in Section 3) with their 
weakly-secure counterparts, we convert our protocol into a weakly secure (but also computationally 
more efficient) one. 

The second alternative is data perturbation or sanitization. The idea would be for each party 
to somehow perturb his data until he is happy to release it to the other parties (e.g., through the 
addition of random noise). Thereupon the parties would each have a noisy copy of all the data, 
and could locally compute whatever statistical method they wanted on the union of the data. The 
difficulty with this approach is that to protect privacy may require the addition of noise of such 
amplitude as to render the data itself useless. 

3 Primitives for Secure Protocols 

In this section we lay out some primitives and sub-protocols which we will commbine to make a 
full logistic regression protocol. While, details of the implementation of these primitives are in the 
references cited, we also include some in the appendix. 
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3.1 Secret Sharing 



In our construction we make extensive use of additive secret sharing. The idea is to divide a 
quantity of interest a into P random numbers aj (one for each party) so that aj = a. If the aj 
are distributed uniformly in the field (e.g., the entirety of M) then any subset of the aj will reveal 
nothing about a. In fact the sum over any subset is a random variable, the distribution of which 
does not depend on the secret value. 

We use this construction to keep all intermediate quantities secret during the evaluation of 
Newton's method (i.e., the gradient, Hessian and intermediate parameter vectors). As long as we 
can construct sub-protocols which compute random shares of a quantity, from random shares of 
inputs, then we can compose these sub- protocols together to finally obtain random shares of the 
logistic regression estimate. With these in hand the parties can then exchange shares and reveal 
the vector itself. 

Although the joint distribution of the aj concentrates on the linear subspace corresponding to 
the secret value, marginally the shares are uniformly distributed and do not depend on any param- 
eters. Hence we can easily simulate messages based on these shares since the marginal distributions 
are known, and we achieve security as defined in Section 2 . Next we show how to compute additive 
shares of all the intermediate quantities using the abstract definition of additive shares. Although 
this approach is intuitively appealing, computers would quickly run into problems representing 
samples drawn uniformly from M. Therefore, in the appendix we show how to approximate arbi- 
trarily well the same computations in modular arithmetic on Zb = {0,1, ■ ■ ■ B — 1} for some large 
B. 



3.2 Computing Sums and Products with Random Shares 

To implement Newton's method we must essentially perform linear algebraic operations on random 
shares, for example by computing shares of the Newton step from shares of the gradient and inverse 
Hessian. In this section we describe how to obtain random shares of sums and products of quantities 
that are themselves represented as random shares. Using these constructions, we compute inner 
and outer products of vectors of random shares, and hence also matrix multiplies. 

Computing shares of the sum of two secret quantities a = aj and b = bj is direct, as it 
involves only the local computation aj + bj for each party j = 1, . . . ,P. That is, party j simply 
adds his shares aj and bj together to get a random share of the quantity a + b. Obtaining random 
shares of the product of two secret quantities is more involved: 

ab = ^aj^bk = Y, <^ih + "i^^ 

j k j j kjtj 

The elements of the first sum on the right hand side can be computed locally by each party. The 
second (double) sum, however, requires products between random shares held by different parties. 
To obtain these terms while maintaining the security of the protocol, we turn to oblivious function 
evaluation. That is, we pose the problem of computing the product as evaluating a function so that 
one party only knows the function and the other party only knows his input and the value of the 
function applied to that input. 

The function set up by party j and evaluated by party k on his input, bk, is: 

f(j-,k)ix ; aj) = ajX + rj^k , (6) 
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where ^ is a quantity generated uniformly at random by party j. Evaluation is done in a manner 
so that party j learns nothing of the output (and thus only learns about rj^k which he generated) 
and party k learns only the output. Since party k does not know the value of the random variable 
rj^k he has learned potentially nothing about the true value of the product. Taking nij^k = —'''j,k 
and rij^k = f(j^k){bk '■, dj) we have that rrij^k + '<^j,k = Ojfefc, and thus they form random shares of the 
product ajhk- 

Once parties compute random shares for all the terms ajh^, they can locally compute random 
shares of c = a6 as: 

Cj = ajbj + ^(nfcj + rrij^k) = ajbj + ^(/(fc^j)(aj ; bj,) - rj^j.) . (7) 

k^j k^j 

Summing up these quantities, and utilizing the definition of the linear function set up in (6), we 
easily obtain: 

j j kT^j j k 

which shows that the Cj's in (7) are indeed (additive) shares of the product. 

This protocol generates random shares of the product even if the original shares weren't them- 
selves random, e.g., if they were due to the partitioning of the data. A method which implements 
these encrypted multiplications using fixed-point arithmetic is given in [6]. 

We also note that dividing one secret value into another securely is much more difficult than 
dealing with products and requires more elaborate (and computationally demanding) protocols. 
Below we show how matrix inversion can be performed without any divisions. 

3.3 Evaluating Interval Membership 

We suppose we are able to evaluate the following predicate in a secure way: 

l{a > b} 

Where a, b are secret values held by separate parties. This is known as Yao's "millionaires problem," 
since he described it in the context of determining which millionaire has the most money, without 
disclosing actual bank balances. 

An example of a protocol which computes this predicate is given by Blake and Kolesnikov [2] . 
We can also trivially extend it so that each party receives a random share of the output bit (i.e., 
each party receives a random bit, the "xor" of which yields the correct output bit). Using this 
technique we can also check whether a secret value (i.e., a sum of random shares) is greater or less 
than some constant: 

l{ai + a2 > c} = l{ai > c — 02}, (8) 
where oi, 02 are the random shares of a held by two parties. 

3.4 Securely Inverting a Matrix 

We use a matrix inversion routine built up entirely of matrix multiplications and subtractions, 
thus allowing us to use the constructions of the preceding sections to implement it securely. We 
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obtain the reciprocal of a number a without necessitating any actual division by an application 
of Newton's method to the function f{x) = — a. Iterations follow Xs+i = Xs{2 — axs), which 
requires multiplication and subtraction only. 

It turns out that we can apply the same scheme to matrix inversion, e.g., see [11] and references 
therein. A numerically stable, coupled iteration for computing A~^, takes the form: 

Xs+i = 2Xs — XgMs , Xq = c~^I , , . 

Ms+i = 2M, -M2 , Mo = c-iA, 

where Mg = XgA^ and c is to be chosen by the user. A possible choice, leading to a quadratic 
convergence of Xg — )• A~^ {Ms — t- /), is c = maxjAi(^). In our actual implementation we used 
instead the trace (which dominates the largest eigenvalue, as the matrix in question is positive 
definite), since we can compute shares of the trace from shares of the matrix locally by each party. 
To compute we use the same iteration, with scalars instead of matrices. For this iteration we 
initialize with an arbitrarily small e > (as convergence depends on the magnitude of the initial 
value being lower than that of the inverse we compute). We use the constructions of section 3.2 
to iterate through (9) until convergence. As Mg — )• /, we check for convergence by considering the 
absolute difference between the trace of Mg and the data dimension d, and we can evaluate the 
function l{|tr(Ms) — d\ > e} on random shares of the trace of Mg using the same form as (8). 

4 First Protocol for Logistic Regression 

We recall the usual Newton- Raphson iteration expression (3). To perform the iteration we first 
compute random shares of the update direction: Aj = — (V'^^(/3t))^-'^V£(/3j), via the formulation 
of matrix-vector products of random shares. We can then add these random shares to the current 
parameters Pt to obtain random shares of fit+i- To check convergence recall (from e.g., [3]) we 
should end if: 

A2 = (V£(A))^At < e . (10) 

We can compute (10) securely using the same form as (8). The result is sharable among all the 
parties, and the protocol ends whenever the result is 0, i.e., when is not greater than e. 

By using the constructions of the previous section, we have the tools required to invert shares 
of the Hessian, and thus to compute the Newton step. All that we need to do is construct a secure 
protocol to evaluate the logistic (sigmoid) function. In principle, a specialized sub-protocol could 
be built up using the construction of Yao [21]. The method would be to construct circuit that 
evaluates the sigmoid function in the same manner that the arithmetic logic unit in a CPU would. 
Then we could give this circuit the secure treatment and make it into a protocol following Goldreich 
[9]. The disadvantage with this approach is that the circuit evaluation protocols are prohibitively 
expensive and thus they are not useful in practice except for trivial circuits, see e.g., Malkhi et al. 
[17]. Instead we use a specially crafted approximation to the logistic function in terms of indicator 
functions. We describe this next. 

4.1 A Secure Approximation to the Logistic Function 

The logistic function itself is the CDF of the logistic distribution. We propose to approximate this 
function with an "empirical CDF." This is a function of a set of L samples z;, taken independently 
from a logistic distribution: 
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L 

a{a)^FL{a) = L-^Y.^{a>zi]. (11) 

1=1 

Based on the Glivenko-Cantelli theorem, and later work by Dvoretzky, Kiefer and Wolfowitz, 
the rate at which the empirical CDF converges to the true CDF (i.e., the logistic function which is 
of interest) is known. Using these results, we obtain bounds on the maximum difference between 
the logistic function and our approximation, which hold with high probability. See the remark 
below in Section 4.2 about the accuracy of the approximation. 

We now turn attention to obtaining random shares of the logistic function evaluated at random 
shares of P'^Xi. We obtain random shares of by using the inner product construction for 

multiplying together random shares. If we denote shares of this inner product by {0^Xi)j for party 
J, we write: 

aiP'^Xi) « L-^ ^ = l{(/5^a;i)i + (/3^Xi)2 > zi] . (12) 

Thus the problem reduces to getting random shares of the sum of indicators. Note that we can 
re-write each indicator function as: 

l{(/3^x,)i + (/3^Xi)2 > zi} = l{(/3^Xi)i >zi- {P^Xi)2} . (13) 

If party 2 generates the logistic random variables then we have a trivial reduction to (8). In order 
to restrict the view of either party to a random share, we restrict the output to random bits o\, 
and 02, such that 

1 if l{a > zi} 
otherwise ' 



o[ ® O2 



where © is the exclusive or. The right-hand side of equation (12) requires (random shares of) the 
fraction of outputs with o^^ © 02 = 1. This can be established by noticing that 

L L L 

5^(4 ® 4) = 5^ o\ + Y,o\- 2of 02 , 

1=1 1=1 1=1 

where we denote = (o^, . . . ,0^) for k = 1,2. Jagannathan and Wright [12] use this method to 
convert xor shares into additive shares for a different privacy-preserving task. 

In order for the output to behave this way, we can either use Yao's protocol directly, or take a 
(more efficient) GT protocol and modify it to give a (xor) random share. In this work we use the 
protocol of Blake and Kolesnikov [2]. Having computed random shares of the logistic function, we 
can use the constructions of Section 3.2 to compute random shares of the gradient and Hessian, 
and hence build a full logistic regression protocol. 

4.2 Quality of the Logistic Approximation 

A comment about the accuracy of approximation (11), and the resulting logistic parameter es- 
timator is in order. The tail behavior of the sup- norm of the error is given, for every e > 0, 
by: 

L 

P{\\a{-) - ^ ^'}IL > ^) < 2e-2^^' , (14) 

1=1 
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known as the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. One possible way to choose the 
number of Logistic variables L in practice, is by ensuring that the above probability is no more 
than a prescribed level of accuracy, say a. Solving for L we obtain the (very conservative) bound 
L > —56"^ log(a/2). A less conservative bound might entail the maximum absolute error restricted 
to some interval (containing the origin). 

We relate error in the approximation of the sigmoid (and hence the gradient) to the error in 
the convergent parameter by the following inequality: 

P-/?lb< ^'^" + "'."-^^""-' . (15, 

in which /3 is the optimizer of the exact log likelihood, and /3 is the optimizer of our approximation, 
Amin is the smallest eigenvalue of the Fisher information matrix /(/3) = — n~^V^^(/3) (on some 
interval, see Appendix A), and R is the radius of a ball which containing all the data vectors (i.e., 
Vi, ||2;j||2 < R). The proof of this inequality follows lemma 1 of [4] in which the two convex func- 
tions are the exact log likelihood objective, and the difference between the exact and approximate 
objectives. See Appendix A for detailed theoretical derivation. 

Since we can use expression (14) to bound the numerator of expression (15), the parameter 
output by our protocol, and that output by the exact (non-private) algorithm can be brought 
arbitrarily close (except on a set of negligible probability) by increasing the parameter L. Later, we 
perform an experiment to show how well the method performs with reasonably small L. Note that 
for Newton's method to converge in this approximation, we must use the same sample of L logistic 
random variables each time we approximate the sigmoid. Otherwise assessing convergence would 
be difficult as the objective function would be constantly shifting. We propose that the parties 
draw L logistic variables ahead of time, and use these for all the computations. 

4.3 Hessian Lower Bound Technique 

Notice that the Newton Raphson method requires inverting a matrix (the Hessian of the log like- 
lihood) at each iteration. In our setting, using our iterative inversion method this becomes very 
expensive. Therefore we propose to use a well-studied approximation [18], which replaces the 
iteration by: 

f3t+i = Pt- A{X^X)-^Vi . (16) 

First note that under this technique the algorithm only ever needs a single matrix inversion, since 
X is constant throughout all the iterations. Second, this algorithm still eventually converges 
to the correct parameter value (modulo the other approximations we make in our protocol). The 
reason is that the inverse hessian is always greater than A{X'^ X)~^ , in the sense that the difference 
is positive semi-definite, see e.g, Minka [18] for more details. What's more, this technique ensures 
that progress towards the optimum is monotonic, and so assessing convergence may be simpler. 

4.4 Computation and Communication Complexity 

First we count how many times we must run each of our primitives for each iteration of Newton's 
method. The approximation of section 4.1 requires nL instances of the GT protocol per round, as 
L instances are required per case. Computing the gradient and the Hessian requires n(l + d + d?) 
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multiplications. Inverting the Hessian takes 2d^ multiplies and one GT per iteration of (9). Since 
this inner iteration is quadratically convergent, it takes O(logd) iterations to converge, and thus 
takes 0(d^log(i) multiplies and O(logci) instances of GT. In total then, each outer iteration takes 
0{ncf + logd) multiplies, and nL + 0{logd) invocations of the GT protocol. 

Each multiplication requires a number of encryptions and decryptions; this scales quadratically 
with the number of parties P since they must exchange with one another. Thus the computational 
workload increases as the data are split into more pieces. Note that although repeated use of the 
cryptosystem is quite expensive, performance on normal hardware is relatively rapid. A machine 
dedicated to the computation and running multiple threads can do thousands of encryptions per 
second. 

Each instance of GT using the protocol of [2] requires 0(logi?) encryptions and decryptions 
(and operations on encrypted values etc.). Therefore in total our approximation of section 4.1 
requires 0{nL\ogB) encryptions per iteration. This may be too computationally demanding for 
large L. One way to reduce this cost is to run the scheme using a coarse approximation to the 
sigmoid (i.e., a small L) to convergence, then increase L, resample the logistic variables and then 
continue Newton's method from the previous convergent parameter. Although the latter iterations 
will still be computationally burdensome, there will be fewer of them. Another way is to use a 
different approximation to the sigmoid function. This is outlined next in Section 5. 

Note that the total amount of communication by all parties is also proportional to the number 
of multiples and GT invocations. For an invocation of either, a party must transmit logA^ bits to 
another party, and then receive a message of the same length. There are a total of 0{P'^{nd^ + 
d^ log d) + nL) messages which must be sent for each iteration. If the number of parties or cases, 
or the granularity of the approximation is large, running the protocol over a high speed local area 
network would make the communication overhead manageable. 

5 Second Protocol for Logistic Regression 

As we mentioned above, the computation complexity of evaluating approximation (11) to the 
logistic function scales linearly with L, since on each of Newton's iteration we invoke Yao's protocol 
to compute the GT predicate, and we do it for every case i. This may be prohibitively expensive 
even for a moderate L. A possible way to reduce this computational burden was briefly described 
in Section 4.4. Here, we provide full details of a more structured approach, which is reminiscent 
of Euler's method. The approach is built (again) on computing Newton's iteration (3). It would 
be more natural in this section to treat the logistic function in a vectorized fashion, i.e., o"(a) = 
((T(ai), . . . , a{an))-, for an n-dimensional vector a = (ai, . . . , a„). Therefore, we use different, albeit 
equivalent, representations for the gradient and Hessian: 

V£ = X^{y - a{Xp)] , V^l = -X^diag{CT(X/3) o (1 - a{X^))]X . (17) 

Here X is the design matrix whose rows are xj , the units or feature vectors (see (2)). The symbol 
"o" denotes the element- wise product, i.e., uov = vou = d\ag[u)v. 

We modify the iteration so that we neither explicitly compute the logistic function a{-) which 
is involved in both the gradient and Hessian, nor use the approximation in expression (11). Note 
that throughout the procedure we may treat each unit Xi as having an associated logistic function 
value cr{fijxi). We propose to track a vector of approximate function values at ~ (j{Xf3t) which 
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will be updated after each iteration. Then, these approximate values will be used to compute the 
next iteration of Pt. Note that the derivative of the logistic function is given by: 



a'{a) = a{a){l-a{a))''^'g{a{a)). (18) 

Therefore, knowing the value ex (a), we can determine the derivative of the logistic function around 
a by a single multiplication. Linearizing around some value oq gives: 

a{a) = (T(ao) + (a - aQ)g{a{aQ)) + 2~^{a - aofa"{-)\^, « cr(ao) + (a - 00)5(^^(00)) , (19) 

where the second derivative is evaluated at some value a* in the interval between a and ag. Denote 
by Af = Pt+i — Pt as in Section 4, then may make use of the approximation: 

^t+i = ^t + {X/\t) o g{at) , (20) 

where g is applied element-wise to at- 

Over the course of the entire algorithm, the approximation at is updated repeatedly, in a manner 
very similar to using Euler's method to numerically integrate the differential equation (18). It is well 
known that the error of this method decreases with the size of the "step" taken at each iteration. 
In the above, the steps are of size XAt, which will in general be different on each iteration, and 
will also be different for each unit. In order to control the error we amend this approximation by 
breaking down the step into k smaller steps each of size k~^XAt, and performing k such updates. 
As we shall see, we may base our choice of k on some aspect of the design matrix, X, in order to 
reach a desired level of error in the approximation. We write this approximation as: 

k 

at+i = at + k-'XAt o 9{^t) = + XAt o 5fc(<Ti, XA*) , (21) 

i=l 

where the a* are the intermediate values corresponding to the inner iterations, and we define gt as 
the function which gives the average value of g evaluated on these values. 
We summarize our method in the following coupled iteration: 

ao = 2-1 -1"^! 

At = 4(X^X)-iX^(y - at) (22) 

Pt+i = Pt + At 

Gt+i = at + XAt ° gk{^t,XAt) , 

where 0'^^^ is the d-dimensional vector of zeros and l"-^i is the n-dimensional vector of ones. The 
proposed iteration differs from the protocol of Section 4 (and from the usual Newton-Raphson 
method). The main difference is that we have replaced the logistic function approximation (11) 
with our Taylor approximation. Note that we are using again the bound on the Hessian (see Section 
4.3), which would make computation easier. We use this technique in our method for this reason, 
and also since it interacts well with our Taylor approximation by ensuring that convergence towards 
the optimum is in a sense monotonic, as shown in Section B.l. In keeping with our goal of using 
only sums and products, we recall that it is possible to invert a matrix with just these operations 
(see Section 3.4). 
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We now present a bound on the distance from our approximated regression coefficients /3t, to 
the true optimizer of the log-hkeUhood which we denote by /3, as in (15). Since our iterations are 
guaranteed to converge (see Section B.l), we can choose to run the iterations until ||X-^(y — o"t)||2 
is smaller than some threshold b (i.e., by choosing t accordingly): 

b>\\X^{y-dt)\\2 > \\X^iy-at)\\2-\\X^icTt-at)\\2 ■ 

Therefore we can bound the norm of the gradient of the logistic log-likelihood taken at our final 
parameter estimate: 

\\Viil3t)\\l<b + nRcT . 

where R is the radius of a ball containing all the data vectors, exactly as in (15), c is some constant, 
and T is a quantity upper bounding the maximal Euler's step size. 

We can use this to construct our main result about the quality of our approximation. Suppose 
we choose b < nRcr, then from the above we have: 

m-P\\2<^, (23) 

^min 

where Amin is the smallest eigenvalue of the Fisher information matrix /(•) = — n^^V^£(-) in the 
line segment between (3 and /3. Note that Amin = "'^'^Amm and the factors of n cancel. 

Therefore we can make the accuracy of our approximation arbitrarily good by decreasing r 
although, as we shall see there is a tradeoff involved. A smaller r usually means a higher k, 
resulting in increased computational demands. We refer the reader to Appendix B for complete 
technical details. 

5.1 Choice of k 

Thus far we have that the error of the approximation decreases as r is decreased; however, this last 
variable is not controlled directly (as L was in protocol 1) but rather is a function of k, the number 
of steps taken for each outer iteration of the algorithm. 

In principle, to get at a prescribed step size r, we can choose k by noting that: 

T < \\k-^XAt\\oo < k-^\\XiX'^X)-^Xiy - at)\\2 < k'^^ (24) 

This leads to the overly conservative choice oi k = T~^^fn. An alternative choice is to run the 
protocol with a small value of /c, e.g., 10, and then to re-run with different values to assess the 
sensitivity of the computation. In Section 7 we show that this technique performs wekk even with 
small k. 

5.2 Computational Complexity 

We can measure the overall complexity of our method in terms of the number of products that are 
needed, since these are the most time-consuming operations we use. First note that to construct 
the matrix X'^ X takes nd^ products, and inversion of this matrix takes 0((i^) using (9), where the 
constant is related to the condition of the matrix. Then on each iteration, to compute takes 
nd -f products. Our approximation to the logistic function takes nk products, for a total of 
n{k -|- d) -|- products per iteration. 
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We compare this with the cost of a protocol which computes the logistic function via a specially 
designed sub protocol based on circuit evaluation, cf. Yao [21]. If the latter may be evalutated using 
q encryptions, then the complexity would be n{d + q) + cP operations per iteration. As mentioned 
before this number would typically be much larger than k (for example on the order of the number 
of bits used to represent the numbers). Therefore on each iteration we can save a multiple of n 
operations, which may be especially important when n is large. 

6 Security Guarantees 

Since our protocol runs until convergence, the number of rounds is variable and depends on the data 
itself. Furthermore a matrix inversion was performed by an iterative scheme which itself took some 
variable number of iterations to converge. Therefore we amend the protocol so that the output for 
each party is a triple consisting of the convergent parameter value, and the number of iterations 
it took to converge, and the number of iterations taken for the matrix inversion. This way the 
messages received from testing convergence are easily simulated (i.e. a zero on every round up until 
the number specified in the output, then a one on that iteration) and this clearly reveals no more 
information since the parties know "where they are" in the protocol at all times and could count 
these numbers of iterations. Having dealt with this technicality we will consider simulating the 
other intermediate messages in our simulator, and consider these convergence tests already taken 
care of. 

In both of our protocols, the messages which are transmitted are always part of some sub- 
protocol, namely multiplication or evaluation of the "greater than" predicate. The only exception 
to this is the final messages which are sent immediately before the output is reconstructed. As 
those messages are themselves random shares they may be simulated easily (although they must 
be simulated in such a way that they sum to the correct output values, but this is trivial). The 
messages which are passed during the sub-protocols may be simulated based on their respective 
input and outputs so long as the sub- protocols are cryptographically secure. Since we take care 
to ensure that the intermediate values are random shares, the simulators for the sub- protocols 
"compose" to form a simulator for the main protocol (see [9]). 

7 Illustrative Experiments 

We provide two illustrative experiments to demonstrate our approach. The first aims at showing 
the performance of our protocol from Section 4. Specifically, we examine the effect of approximation 
(11) on the resulting parameter values, when small, and large number of logistic variables L are 
being used. The second example takes a look at the altered protocol from Section 5, which uses 
the coupled iteration (22) instead of approximation (11), and reports its performances for different 
values of k, the number of Euler's "steps". 

For both experiments we use an extract from the Current Population Survey (CPS) data (see 
http://www.bls.gov/cps/), which includes data on a sample of slightly more than 50,000 U.S. 
households. We focus on predicting whether household income is greater than 50,000 dollars. We 
converted M-category features into M — 1 binary features, and divided age into 4 bins correspond- 
ing to 20 year intervals. Note that although we expressed our approach in terms of continuous 
covariates, it handles binary flags just as well, where said covariates take on e.g., 0.0 and 1.0. 
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Our protocol from Section 4 deviates from the exact computation in two ways, first we use an 
approximation to the gradient, and second we perform all the calculations in fixed-point arithmetic. 
Both of these approximations can be made arbitrarily tight but at the expense of computational 
efficiency. To demonstrate that our protocol can be implemented in an efficient manner and produce 
reasonably accurate results we implemented it in a simulator and compared the results to exact 
logistic regression on the CPS data. 

For each of L = 100 and L = 500 we ran our first protocol 100 times. The table below shows the 
means and standard deviations of the resulting parameter values. Evidently, as L gets bigger, the 
accuracy of the parameter values improves. Figures 1 and 2 show how the likelihood of the estimate 
maintained by the protocol increases with the number of iterations. We computed the error bars 
by removing the 5 samples that deviated from the mean by the greatest amount, and plotting 
the minimum and maximum from the remaining ones. This then corresponds to an approximate 
95% confidence interval, and would become an exact interval if we were to perform more and 
more simulations. For the purposes of comparison, we also plotted the likelihood achieved by the 
exact non-private Newton Raphson algorithm, and a non-private algorithm which we referred to 
as "hessian lower bound." Both give upper bounds for what we hope to achieve, the latter is an 
algorithm where we just use the approximation of (16), and exact (i.e., non-private) logistic sigmoid 
values. We see that as L increases, the first protocol more closely approximates the hessian lower 
bound technique, which converges more slowly than the exact Newton Raphson method. 

For the second experiment, we ran our coupled iteration on the CPS data with k = 5, 10. 
Although each iteration of our algorithm may be cheap, all is for nought if we require many more 
iterations for convergence. To determine whether this happens we compared our method to the 
Hessian lower bound method of (16), since this represents our algorithm without the approximation. 
In figure 3, we plot the likelihoods of the second protocol against the iteration number. Since there 
is no randomness in the second approximation, there are no error bars. Even for small values 
of k, much smaller than those suggested by (24), the approximation to the Hessian lower bound 
technique is quite good, and increasing k further (e.g., to 50) results in curves which are exactly 
the same as that of the Hessian lower bound method. In table 1 we show the resulting parameter 
estimates for both methods. 

8 Beyond Logistic Regression 

We can use the construction of Section 4 to build secure protocols for similar statistical calculations, 
e.g., the constructions for computing shares of outer products and matrix inverses naturally yield a 
secure algorithm for performing linear regression, for details see [6]. Furthermore using the "ridge 
regression" penalty on the weights (i.e., computing a MAP estimate under a Gaussian prior) can 
naturally be added to the protocol for both linear and logistic regression. It is also possible to 
implement the coordinate ascent computation of the lasso (or sparse logistic regression) using these 
constructions (i.e., using the GT protocol to perform soft thresholding). 

Our protocol generalizes to the class of Generalized Linear Models (GLMs) including logistic 
regression with other link functions. GLMs consist of a random component Yi from an exponential 
family, a systematic component with a linear predictor rji = xj(3, and a link function r/j = h{fii), 
where ^Uj = El^. If h makes the linear predictor rji = Oi, where 6i is the natural parameter of the 
exponential family, h is canonical. 

For Poisson log- linear models with the canonical link, Hi = exp{?7j}, we approximate the ex- 
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NR 


PI, L = 100 


s.d. 


PI, L = 500 


s.d. 


P2, fc = 5 


P2, A; = 10 


Intercept 


-10.7536 


-11.6306 


1.0761 


-11.5732 


0.5339 


-10.7944 


-11.2262 


Child Sup 


0.0002 


0.0002 





0.0002 





0.0002 


0.0002 


Property Tax 


0.0003 


0.0003 





0.0003 





0.0003 


0.0003 


Num in Household 


0.9802 


0.9916 


0.0863 


0.9881 


0.0434 


0.9259 


0.9601 


Num Children 


-1.056 


-1.0721 


0.0935 


-1.0685 


0.047 


-1.0017 


-1.0384 


Num Married 


0.0342 


0.0343 


0.0053 


0.0343 


0.0022 


0.032 


0.0333 


Child Sup Ind. 


-0.0001 


-0.0001 





-0.0001 





-0.0001 


-0.0001 


Education 


0.3218 


0.3276 


0.0295 


0.3265 


0.0146 


0.3058 


0.3172 




-4.6178 


-3.9701 


0.3451 


-3.94 


0.1638 


-3.6202 


-3.8011 


Age 


-4.2368 


-3.6782 


0.3148 


-3.6596 


0.1504 


-3.4817 


-3.5823 


-3.9608 


-3.3355 


0.2852 


-3.3157 


0.135 


-3.1592 


-3.2481 




-4.9575 


-4.4069 


0.3795 


-4.3814 


0.1829 


-4.1096 


-4.2624 




0.0064 


0.0051 


0.0282 


0.0001 


0.012 


-0.0035 


-0.0007 




-0.5362 


-0.5623 


0.058 


-0.5562 


0.0279 


-0.5205 


-0.5404 


Marital Status 


-0.4378 


-0.4718 


0.0819 


-0.4609 


0.0374 


-0.3934 


-0.4321 


-0.5855 


-0.6096 


0.0675 


-0.6018 


0.0309 


-0.572 


-0.5889 




-0.9617 


-1.0283 


0.1146 


-1.0116 


0.0549 


-0.957 


-0.9874 




-0.7496 


-0.7888 


0.0852 


-0.7783 


0.0401 


-0.736 


-0.7598 




0.2417 


-0.2588 


0.0276 


-0.2565 


0.0135 


-0.2401 


-0.2491 


Race 


-0.4981 


-0.5167 


0.05 


-0.5128 


0.0244 


-0.4835 


-0.4997 




-0.1658 


-0.1796 


0.0196 


-0.1785 


0.0102 


-0.1631 


-0.1719 


Sex 


-0.27(33 


-0.2802 


0.0211 


-0.2792 


0.0125 


-0.2(313 


-0.2712 



Table 1: Estimates produced by the exact method (Newton Raphson), and the two protocols, for 
different parameter settings of the protocols. 
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ponential function similarly. For Gamma models with the canonical link, Hi = 1/rji, and for 
inverse-Gaussian models with the canonical link, 1/^^, we can use the number inverting without 
division scheme. Our approach can also be extended to treat binary regression with non-canonical 
links, such as the probit link function, or more generally, inverse CDF link functions. The general 
form of the gradient is: 

- 2^ Var(y,) % ■ 

Let F denote a given CDF (F = ^ leads to the probit link function, while, of course, F = Fl leads 
to the logit link function). Then, = F{r]i), and thus d^i/drji = f{rji), where / is the density. 
Therefore, we should find approximations for / as well as for F (approximation for F will follow 
the same idea as for Fl, i.e., using the empirical CDF). 



9 Conclusion 

We have demonstrated that a fully secure approach to logistic regression based on the cryptographic 
notion of security may be made practical for use on moderately large datasets shared between several 
parties. Although it is slower than methods with weaker security guarantees, it offers more rigorous 
guarantees with respect to the privacy of the input data. We emphasize that our protocol (like 
any cryptographic protocol) prevents leakage of information which may arise from the computation 
itself. It does not address any leakage which results from the output. 

The problem of secure regression is far from solved however, we have yet to deal with the 
problem of record linkage, and have implicity assumed that the parties know how their respective 
datasets are aligned. Furthermore record linkage due to a statistical model may be incorrect and 
may result in errorful estimates of model parameters. 



A Theoretical Validity of the First Protocol 

Here we show how a bound on the error in the approximation (11) to the logistic function leads 
to a bound on the quality of the convergent parameter vector output by the protocol. Specifically, 
we establish the validity of (15). Let R denote a constant such that ||xi||2 < R, for i = 1, . . . ,n. 
Recall the expressions for the gradient V£ and Hessian given in (2). Define the approximated 
gradient, by substituting Fl for a: 

n 

= Yl ""^y- - ^Mxfp) . (26) 
j=i 

Rewriting = V^(/3) + J2^=i ^iPhixJ fi) — Xia(xff3), and applying the triangle inequality we 

obtain a bound on the norm of the gradient of the logistic objective: 

\\Veim2<\Nm\\2 + nR\\FL{-)-a{-)\\o.. (27) 

Next we convert a bound in the norm of the gradient into a bound on the distance to the optimum. 

Lemma A.l. Let f3 be the optimizer of the logistic regression objective, and let Xmin denote the 
smallest eigenvalue of the negative Hessian in the line segment between (3 and p. Then: 

||,_^||,<». (28) 
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Proof. We use the mean- value theorem (for vector- valued functions) to write the difference between 
gradient vectors at /? and /3: 

V£(/3) - Ve0) = Ve{P) - = V^e{aP + (l - a)/3) da^ (/3 - /3) . (29) 

Now, for every (symmetric) matrix B, and a non-zero vector e, the Rayleigh quotient satisfies 
e^Be/e^e > Amin(-B), where Amin(-B) is the minimal eigenvalue of -B. If = yl^, for a posi- 
tive definite (symmetric) matrix A, this reduces (after taking the square root on both sides) to 
||^e||2/||e||2 > Amin(^)- Applying this to (29), and using Weyl's inequality, we have: 



l|V^(/3)||2= ^ v2(a/3 + (l-a)/3) daj (/3-/3) ^> A^in||/3-/3||2. (30) 

This completes the proof. □ 
Lemma A. 2. Using the same notation we have: 

min 1 1 V^(/3) 1 12 < ni?L-^ (31) 

where 55 is a (non-empty) set of logistic parameters defined in the proof. 

Proof. Consider a continuous, monotonically non-decreasing function g(-) which satisfies \\g{-) — 
Fl{-)\\oo < L~^. Such a function clearly exists, for example the smooth nondecreasing curve which 
goes through all points {zy^,jL~^) where 1 < j < ^ (where is j^^ smallest logistic variable 
used in Fl). Since g{-) is nondecreasing, it is the derivative of some convex function: 



G{a) = / g{b) db (32) 

J —00 

Consider the approximation to the logistic gradient which uses g instead of F^: 

n 

Vm = x^yi - Xig{xf(3) (33) 



i=l 

This is the derivative of a concave function: 

n 

m = Y.'^JPyi-G{xJp) , (34) 

i=l 

which is indeed concave since it is a linear function minus a convex function. Hence i has a unique 
maximum somewhere. Consider the functions g{-) so that the maximum is in the interior of the 
space (i.e., is not at infinity). Hence for each such g we have a point /3 S M*^ where the gradient 
is zero, i.e., V£(/3) = 0. Denote the set of such (3 by 55, and note that *B is not empty. An argument 
similar to the one that led to (27) shows that: 

n 

\\mm\2 = ||V£(/3) + ^x,g(:Ef/3) -x,Fi(xf/3)||2 < WVmh + nRL-^ . (35) 

i=l 

Therefore: 

||V^~(^)||2 < \\Vl{M2 + nRL-^ = nRL-^ , (36) 
which completes the proof. □ 
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We now put this all together and state the main result about our approximation F^. 

Lemma A. 3. // our approximation Vi is used as an approximation to the gradient of the logistic 
log likelihood, and numerical optimization is performed until \ \Vi{(3)\\2 < nRL~^ , then: 

\w-m,<?i£i±mi^<yM, (37) 

A mm 

where /3 is the optimizer of the exact logistic regression objective, (3 is the result of our numerical 
optimization, R is the radius of a hall containing all the Xi, and Xmin is the smallest eigenvalue of 
the Fisher information matrix /(•) = —n~'^'V'^i{-) in the line segment between /? and /3. 

Proof. Notice that ||V^(/3)||2 < nRL^^ is guaranteed in light of Lemma A. 2. The proof follows by 
substituting (27) into (28), and by noticing that Amin = "-"^Amm and the factors of n cancel. □ 



B Theoretical Validity of the Coupled Iteration 

Here we establish the convergence of the coupled iteration (22), and the error in our Taylor ap- 
proximation of the logistic function. 



B.l Monotonicity and Convergence 

We show that the update described in (22) converges monotonically towards some final value /3. 
We relate the size of the step taken at one iteration to the size of the step in the previous iteration. 
We aim to show that first, these steps are always in the same directions for each unit, and secondly, 
the steps are monotonically decreasing and eventually the iterations converge. 

Lemma B.l. AA^+i element-wise has the same sign as XAt, in the sense that AA^+io AAj > 0. 

Proof. If we define the idempotent matrix M = X{X'^ X)^^X'^ , then we write: 

AA^+i = 4X{X^X)-'X{y-at) 

= 4M{y-at) 

= 4M[y-at-i-{XAt)ogk{at-i)] 

= 4MM{y - at-i) - IGMdiag (5fc(at„i))M(y - at-i) 

= 4Mdiag(l-45fc((Tt_i))M(y-CTt_i) 

= Mdmg{l-A~gk{at-i))XAt , (38) 

where we made use of the idempotency of M. Next considering the element-wise product as the 
diagonal of the outer product of these two matrices, 

XAt+i{XAtf = Mdmg{l-Agk{at^i))XAtAfX^ 

Since we clearly have that 1 — 4:gk(at-i) > no matter what value &t-i takes (due to the 
definition of gk), we have that this matrix is the product of positive semi-definite matrices, and 
therefore is itself positive semi-definite. Therefore the diagonal elements are all non-negative, and 
we have proved the claim. □ 
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This result allows us to analyze our approximation to the logistic function as though we were 
using the forwards Euler method to integrate the differential equation (18), since all the steps for 
any particular unit will be in the same direction. 

Lemma B.2. As long as each step k^^\XAt\ < r < 1 (where the inequality is element-wise), then 
< (Tt < 1, Vt (i.e., the approximate logistic values will remain between and 1). 

Proof. Suppose that the step is positive for all units and 0"^ < 1, then: 

c^t+i - cFt< Tatil - af) <l- at , 

so we also have that at+i < 1. Likewise for units which are involved in a negative step, if they are 
greater than 0, then they remain so into the next iteration by an argument which is symmetric to 
the one above. Therefore we have that our logistic values never leave the interval (0, 1). □ 

With this we also have that < 4^fc((Tt) < 1 for all t, from the definition of g and g^- Substitution 
into (38), yields that: 

\\XAt+i\\2 < \\M\\2 ||diag(l-45fc(<Tt_i))||2 \\XAt\\2 < \\XAt\\2 , (39) 

since M has eigenvalues which are each either or 1. This shows that the magnitude of the 
steps for the individual units is shrinking towards zero. Therefore we conclude that eventually, our 
approximations of the logistic values stop updating. If we assume that X has d linearly independent 
columns, then this also implies that is going towards zero, and therefore our algorithm eventually 
converges. 

B.2 Quality of the Logistic Approximation 

We now analyze the error in the approximation of the logistic function values. We then use this 
together with the convexity of the problem to yield a bound on the error in the convergent pa- 
rameters (see (23)). To aid the notation, in this section we consider the problem of estimating the 
logistic values for just a single case, and specifically one for which the steps are all positive. Due 
to the symmetry of the logistic function about 0, we will then have the same type of bounds on 
the error when the approximation updates in the negative direction. We first show a loose upper 
bound on the supremum of the error which would be encountered if the approximation was run for 
an infinite number of steps of size at most r, and then use this to bound the error after finitely 
many such steps. 

As we have shown by the above monotonicity argument, our approximation to the logistic 
function is essentially analogous to using Euler's method to integrate the derivative of the logistic 
function. Since we consider approximating a single value, we change the names of our variables to 
avoid confusion with the previous vector valued approximation. If we denote by st the approximated 
value after t steps of various sizes, tq . . . Tt-i < r. Thus st st = (T{at) where at = X]i=o ^® 
compare this approximation to the exact values and consider the error: 

it = St - St . 
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Making use of the step (19), we evaluate the error in the next iteration: 



6+1 — ^t+i — st+i 

= st + Ttg{st) - St- Ttg{st) - 2~^Tta" {■)\a* 
= it + Tt[g{st)-9{st)] + Qt 
= it + Tt{st - st)g'{-)\s* + Ct 

= Ct{l + n - 2Ttst) + Ct 

Where we have defined 

C, = -2-iW(.)L. (40) 

and at < < at+i is some value in the interval about which the second derivative is taken. 
Likewise s* is bounded between st and st- As we have seen from (39), as long as < r < 1 then 
< S( < 1 for all t. Since we only consider positive steps > then we have that < < 1, 
and hence the same bound applies to s*. Therefore we have that: 

l6+i| < 161 + 101 

Therefore we see that: 

oo 

sup 161 <^ 161 (41) 
* i=l 

Examining the form of (t"(-) , we find it to be a function which is everywhere negative. Examining 
the third derivative, we find that the second derivative has exactly one stationary point in [0, oo) 
which is located at: 

6-^/l2 6 + ^/l2 

a = — log =, a[a) = 

6 + \/T2' ^ ' 12 

Whats more, we see that d^a{-) < on [0,a*), and d^a{-) > on (a*,oo). Therefore we have 
that a* is the minimum of the function. Using this we bound the sum (41) by an integral: 

- y"f^"(-)la* < - / (y"{a) da - 2Ta"{x*) = 4"! - 2ra"(a;*) . 
Substituting this into (41) and (40) we have that: 

max \it\ < 2-V(4-i - 2Ta"{x*)) = ct + dr^ ^ ct . (42) 
We can make the approximation arbitrarily tight by decreasing the step size. 
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Figure 1: Log Likelihood vs iteration number for protocol 1 with L = 100, and that of the "Hessian 
Lower bound" algorithm, which is the same as protocol 1 except with exact sigmoid evaluations. 
We also compare to the full newton raphson method, which inverts the hessian on each iteration. 
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Figure 2: Log Likelihood vs iteration number for protocol 1 with L = 500, and that of the "Hessian 
Lower bound" algorithm, which is the same as protocol 1 except with exact sigmoid evaluations. 
We also compare to the full newton raphson method, which inverts the hessian on each iteration. 
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Figure 3: Log Likelihood vs iteration number for protocol 2 with k = 5, 10, and that of the "Hessian 
Lower bound" algorithm, which is the same as protocol 1 except with exact sigmoid evaluations. 
We also compare to the full newton raphson method, which inverts the hessian on each iteration. 



26 



